Explicit Versus Latent Concept Models for Cross-Language Information Retrieval

نویسندگان

  • Philipp Cimiano
  • Antje Schultz
  • Sergej Sizov
  • Philipp Sorg
  • Steffen Staab
چکیده

The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA) to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on “latent” concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (crosslanguage) information retrieval research.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Explicit vs. Latent Concept Models for Cross-Language Information Retrieval

The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such ...

متن کامل

Combining Wikipedia-Based Concept Models for Cross-Language Retrieval

As a low-cost ressource that is up-to-date, Wikipedia recently gains attention as a means to provide cross-language brigding for information retrieval. Contradictory to a previous study, we show that standard Latent Dirichlet Allocation (LDA) can extract cross-language information that is valuable for IR by simply normalizing the training data. Furthermore, we show that LDA and Explicit Semanti...

متن کامل

Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval

This paper introduces a cross-language information retrieval (CLIR) framework that combines the state-of-the-art keyword-based approach with a latent semantic-based retrieval model. To capture and analyze the hidden semantics in cross-lingual settings, we construct latent semantic models that map text in different languages into a shared semantic space. Our proposed framework consists of deep b...

متن کامل

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested int...

متن کامل

Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus

In this paper we study cross-language information retrieval using a bilingual topic model trained on comparable corpora such as Wikipedia articles. The bilingual Latent Dirichlet Allocation model (BiLDA) creates an interlingual representation, which can be used as a translation resource in many different multilingual settings as comparable corpora are available for many language pairs. The prob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009